Could we use machine learning predictive models to get unbiased causal effect estimations with confidence intervals?
BUT ML models are good for prediction (not causal inference)
What is the effect of variable \(D\) (e.g. having a college degree) on some response \(Y\) (e.g. income)
We consider the partially linear model
\[ \begin{align}\begin{aligned}y = \tau \, D + g(X) + \epsilon, & &\epsilon \sim \mathcal{N}(0,1),\\ D = m(X) + v, & &v \sim \mathcal{N}(0,1),\end{aligned}\end{align} \]
Furthermore it is considered that confusion is complex and/or high dimensional.
Goal: estimate the causal parameter \(\tau\)
To estimate \(\tau\) in
\[y = \tau \, D + g(X) + \epsilon\]
The estimate suffers from regularization bias. See Chernozhukov et al. (2018).
Understanding regularization bias
The reason that the naive estimator does not perform well is that it only selects controls that are strong predictors of the outcome, thereby omitting weak predictors of the outcome. However, weak predictors of the outcome could still be strong predictors of \(D\), in which case dropping these controls results in a strong omitted variable bias.
\[y=\beta_0+\beta_1 x_1+\beta_2 x_2 + \beta_3 x_3+\epsilon\]
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
# Generate synthetic data
np.random.seed(42) # For reproducibility
n_samples = 1000
X1 = np.random.normal(0, 1, n_samples)
X2 = np.random.normal(0, 0.5, n_samples)
X3 = np.random.normal(0, 0.5, n_samples)
# Generate outcome Y with linear effect of X1 and nonlinear effects of X2, X3
Y = 2 * X1 + 5 * X2+0.5*X3 + np.random.normal(0, 1, n_samples)
# Create a pandas DataFrame
data = pd.DataFrame({'X1': X1, 'X2': X2, 'X3': X3, 'Y': Y})
# Multivariate regression
reg = sm.OLS(Y, sm.add_constant(data[['X1','X2','X3']])).fit()
print(reg.summary()) OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.900
Model: OLS Adj. R-squared: 0.899
Method: Least Squares F-statistic: 2972.
Date: Tue, 01 Apr 2025 Prob (F-statistic): 0.00
Time: 12:29:31 Log-Likelihood: -1443.4
No. Observations: 1000 AIC: 2895.
Df Residuals: 996 BIC: 2914.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -0.0145 0.033 -0.445 0.656 -0.078 0.049
X1 1.9832 0.033 59.716 0.000 1.918 2.048
X2 4.8865 0.065 74.953 0.000 4.759 5.014
X3 0.5445 0.066 8.240 0.000 0.415 0.674
==============================================================================
Omnibus: 2.695 Durbin-Watson: 2.037
Prob(Omnibus): 0.260 Jarque-Bera (JB): 2.349
Skew: 0.000 Prob(JB): 0.309
Kurtosis: 2.763 Cond. No. 2.05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
\[y-(y \sim x_2,x_3)= \beta_0 + \beta_1 (x_1 - (x_1 \sim x_2,x_3) ) + \epsilon\]
# Frisch-Waugh-Lovell with linear first-stage regressions
# First-stage regressions (linear)
reg_X1_X2X3 = LinearRegression().fit(data[['X2', 'X3']], data['X1'])
X1_resid = data['X1'] - reg_X1_X2X3.predict(data[['X2', 'X3']])
reg_Y_X2X3 = LinearRegression().fit(data[['X2', 'X3']], data['Y'])
Y_resid = data['Y'] - reg_Y_X2X3.predict(data[['X2', 'X3']])
# Second-stage regression (linear)
reg_resid_resid = sm.OLS(Y_resid, sm.add_constant(X1_resid)).fit()
# Summary
print(reg_resid_resid.summary()) OLS Regression Results
==============================================================================
Dep. Variable: Y R-squared: 0.782
Model: OLS Adj. R-squared: 0.781
Method: Least Squares F-statistic: 3573.
Date: Tue, 01 Apr 2025 Prob (F-statistic): 0.00
Time: 12:29:31 Log-Likelihood: -1443.4
No. Observations: 1000 AIC: 2891.
Df Residuals: 998 BIC: 2901.
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 8.674e-18 0.032 2.67e-16 1.000 -0.064 0.064
X1 1.9832 0.033 59.776 0.000 1.918 2.048
==============================================================================
Omnibus: 2.695 Durbin-Watson: 2.037
Prob(Omnibus): 0.260 Jarque-Bera (JB): 2.349
Skew: 0.000 Prob(JB): 0.309
Kurtosis: 2.763 Cond. No. 1.02
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Let \(y\) be the outcome and \(x_1\) the variable of interest. The coefficient of \(x_1\) on \(y\) can be derived from a two stage process:
\[y-E[y|x_2,x_3]= \beta_0 + \beta_1 (x_1 - E[x_1|x_2,x_3]) + \epsilon\] \[\tilde{y}=\beta_0+\beta_1 \tilde{x}_1+\epsilon\]
Back to the partially linear model
\[ \begin{align}\begin{aligned}y = \tau \, D + g(X) + \epsilon, & &\epsilon \sim \mathcal{N}(0,1),\\ D = m(X) + v, & &v \sim \mathcal{N}(0,1),\end{aligned}\end{align} \]
\[y-\hat{g}_y(X) = \tau (D-\hat{m}_t(X))+\epsilon\]
Using machine learning one gains flexibility: learners can capture complicated functional forms in the nuisance relationships. But that flexibility also introduces the possibility of overfitting.
Residuals are generated using-out-of-fold predictions.
Naive inference based on a direct application of machine learning methods to estimate the causal parameter is generally invalid. There are two different sources of bias: bias due to regularization and to overfitting
Double machine learning embeds two mechanisms to avoid these bias:
Use machine learning algorithms to obtain flexible estimates of the conditional means \(g\) and \(m\)
Use sample splitting (k-fold cross-fitting) for predicting conditional means and computing residuals (for outcome and treatment)
Run a residual on residual regression to retrieve the causal effect
Main result of the DoubleML estimator: unbiasednes and approximate normality
Simulation results show that the quality of causal estimates is affected by predictive performance (choice of learners and tuning parameters is crucial)
Recommended diagnostics: